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Abstract 

This paper is concerned with nearest 
neighbor search in distributional semantic 
models. A normal nearest neighbor search 
only returns a ranked list of neighbors, 
with no information about the structure or 
topology of the local neighborhood. This 
is a potentially serious shortcoming of the 
mode of querying a distributional seman¬ 
tic model, since a ranked list of neighbors 
may conflate several different senses. We 
argue that the topology of neighborhoods 
in semantic space provides important infor¬ 
mation about the different senses of terms, 
and that such topological structures can 
be used for word-sense induction. We also 
argue that the topology of the neighbor¬ 
hoods in semantic space can be used to de¬ 
termine the semantic horizon of a point, 
which we define as the set of neighbors that 
have a direct connection to the point. We 
introduce relative neighborhood graphs as 
method to uncover the topological proper¬ 
ties of neighborhoods in semantic models. 

We also provide examples of relative neigh¬ 
borhood graphs for three well-known se¬ 
mantic models; the PMI model, the GloVe 
model, and the skipgram model. 

1 Introduction 

Nearest neighbor search is fundamental operation 
in data mining, in which we are interested in find¬ 
ing the closest points (to some given reference 
point). Formally, if we have a reference point r and 
a set of other points P in a metric space M with 
some distance function d (or similarity function s), 
the nearest neighbor search task is to find the point 
p £ P that minimizes d{p, r). In fc-nearest neighbor 
search (fc-NN), we want to find the k closest points 
to some given reference point. Nearest neighbor 
search is a well-studied task, and in particular the 
complexity of the task (a linear search has a run¬ 
ning time of 0{Ni) where N is the cardinality of 
P and i the complexity of the distance function 


d) has generated a lot of research ( 

Bentley, 1975 

Arya et ah, 1998 

Indyk and Motwani, 1998 

1 . 
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Figure 1: Examples of neighborhoods with a clear 
branching structure (a) and without (b). 


The problem we are concerned with in this paper 
is not the complexity of nearest neighbor search, 
but the question how to identify the internal struc¬ 
ture of neighborhoods defined by the nearest neigh¬ 
bors. The problem with a normal k nearest neigh¬ 
bor search is that the result (a sorted list of the 
k nearest neighbors) does not say anything about 
the internal structure of the neighborhood. Con¬ 
sider spaces a and b in Figure A nearest neigh¬ 
bor search for the reference point r in these two 
spaces will generate the exact same result, despite 
the fact that the neighborhoods are very differ¬ 
ent with regards to their internal structure (the 
neighbors in space a display a distinct branching 
structure, whereas the neighbors in space b are dis¬ 
tributed evenly across the space). Such structural 
properties of nearest neighborhoods can be very 
important. 

We propose to use relative neighborhood graphs 
in order to identify the structural properties of 
nearest neighborhoods. The use of relative neigh¬ 
borhood graphs also provides a partial solution to 
the problem of finding a relevant k for a given ref¬ 
erence point. Again, consider the neighborhoods 
in Figure [l] The choice of /c = 10 in these spaces 
is completely arbitrary, and could be argued to be 
erroneous, since there are in fact 12 relevant neigh¬ 
bors in both spaces. Another way to approach 
the nearest neighbor search task is to use a ra¬ 
dius around the reference point, so that only points 
within that radius are considered to be neighbors. 
However, setting a global threshold t seems just as 
arbitrary as setting a global number of neighbors 
k. Ideally, t or k should be determined based on 
the structural properties of the nearest neighbor¬ 
hood around the reference point. We refer to this 














factor as the horizon with respect to the reference 
point. 

Although a relative neighborhood graph is a 
general method for defining and structuring near¬ 
est neighborhoods, we are in this paper primar¬ 
ily interested in its application to nearest neigh¬ 
bor searches in distributional semantic models that 
collect and represent co-occurrence statistics in 
high-dimensional vector spaces. The main oper¬ 
ation in such models is nearest neighbor search, 
which is used for finding terms that have simi¬ 
lar co-occurrence behavior. However, a ranked list 
of neighbors does not provide any information on 
whether the neighbors belong to several different 
senses. This problem has been misinterpreted as 
a shortcoming of the distributional representation 


(Erk and Pado, 20101. However, as we will demon¬ 


strate in this paper, this is not a shortcoming of the 
distributional representation, but of the mode of 
querying the distributional model. We argue that 
information about the different usages (i.e. senses) 
of a term is encoded in the structural properties 
of the nearest neighborhoods, and that a relative 
neighborhood graph is a viable tool for uncovering 
such structural properties. 


negatively collinear — to 1 — positively collinear 
— taking the value 0 if the vectors are orthogonal. 

We can thus use a distributional semantic model 
to quantify the similarity between any given terms. 
If the set of given terms is the entire set of terms 
in our model, we are in effect performing a near¬ 
est neighbor search. This is a particularly impor¬ 
tant operation in distributional semantics, since it 
answers the question ’’which other terms are simi¬ 
lar to this one?”, and this is a central question in 
semantics; lexica and thesauri are built with the 
main purpose of answering this question, and a 
nearest neighbor search in a distributional seman¬ 
tic model could therefore be seen as a compilation 
step in a distributional lexicon. 

The result of a nearest neighbor search in a dis¬ 
tributional semantic model is often presented as a 
list of (the top k) neighbors, sorted by descend¬ 
ing similarity with the target term. Table il¬ 
lustrates typical sorted nearest neighbor lists pro¬ 
duced with three different kinds of distributional 
semantic models: a vanilla-flavored model based on 
(positive) Pointwise Mutual Information (PMI)0 
the skip-gram model ( Mikolov et ah, 20I3[ ), and 
GloVe (Pennington et ah, 2014). 


2 Distributional Semantics and 
Nearest Neighbor Search 

Collecting and comparing co-occurrence statistics 
for terms in language has become a standard ap¬ 
proach for computational semantics, and is now 
commonly referred to as distributional semantics. 
There are many different types of models that 
can be used for this purpose, but their common 
objective is to represent terms as vectors that 
record (some function of) their distributional prop¬ 
erties. The standard approach for generating such 
vectors is to collect distributional statistics in a 
co-occurrence matrix that records co-occurrence 
counts between terms and contexts. The co¬ 
occurrence matrix is then subject to various types 
of transformations, ranging from the application 
of simple frequency filters or association measures 
like pointwise mutual information, to matrix fac¬ 
torization or regression models. The resulting rep¬ 
resentations are referred to as distributional vec¬ 
tors, and are typically dense with a dimensionality 
that is considerably lower than that of the original 
co-occurrence matrix. 

The distributional vectors are used to compute 
similarity between terms. There are many ways to 
compute similarity or distance between points in 
vector space; the cosine of the angle between vec¬ 
tors is often the preferred metric in distributional 
semantics because of its simplicity and because it 
normalizes for vector length. Computing the simi¬ 
larity between distributional vectors using the co¬ 
sine measure gives us a score ranging from —1 — 


Table 1: Sorted list of the nearest neighbors to 
“suit” in different distributional models. Different 
fonts represent different meanings of ’’suit.” 


PMI 

CloVe 

skipgram 

suits 

suits 

suits 

dress 

lawsuit 

lawsuit 

jacket 

filed 

countersuit 

wearing 

case 

classaction 

hat 

wearing 

doublebreasted 

trousers 

laiming 

skintight 

costume 

lawsuits 

necktie 

shirt 

alleging 

wetsuit 

pants 

alleges 

crossbone 

lawsuit 

classaction 

lawsuits 


In the vanilla-flavored model, the distributional 
vector of a word is given by its (positive) PMI 
with regards to all other words that have occurred 
within a context window of 2 words to the left and 
2 words to the right. That is, a vector for a word a 
corresponds to the information an observation of a 
gives when predicting surrounding words. Positive 
PMI means that negative values are discarded, and 
only positive PMI values are retained. The cosine 
similarity of two distributional vectors thus gives 
a measure of how similar the information gained 
by observing the corresponding words are. As a 

^For observations a and b, pmi(a,6)= log ■ 

The probabilities are often replaced in distributional 
semantic models by co-occurrence counts of a and b 
and their respective frequency counts. 















way to speed up later computations we apply a 
Gaussian random projection to reduce the dimen¬ 
sionality down to 2000. 

GloVe on the other hand tries to find distribu¬ 
tional vectors such that their dot product approxi¬ 
mates their log probability of co-occurring is moti¬ 
vated by the fact that the logarithm of ratios equals 
the difference of logarithms, which makes the vec¬ 
tor differences meaningful in that they encode (log¬ 
arithms of) ratios of probabilities. Reframed as 
a weighted least squares problem, where rare co¬ 
occurrences are weighted down, it can be solved 
by standard methods. The performance is com¬ 
parable to the skip-gram model, and it performs 


particularly well on word-analogy tasks (Penning¬ 


ton et ah, 20141. 


The objective of the skipgram model is to maxi¬ 
mize the probability of observing all context-word 
pairs given that the probability of one observa¬ 
tion of a word c in the context of t is given by 
exp(m denotes the ”in- 

‘^t) 

put” and ’’output” vectors of the word a, and V is 
the vocabulary. The embeddings are found using 
stochastic gradient descent and hierarchical soft- 
max combined with negative sampling and sub¬ 
sampling. Exactly how these methods compose 
is still unclear, and puts into question what the 


underlying model actually is (Levy and Goldberg, 


20141. Regardless, the skip-gram model delivers 


state of the art performance on a multitude of 
tasks, with very low-dimensional vectors ( [Baroni 
et ah, 20l4| . 


Table lists the 10 nearest neighbors to suit 
in three different distributional semantic models 
using the entire Wikipedia as dataj^ As can be 
expected, there are both similarities and dissimi¬ 
larities between these neighborhoods; ’’suits” and 
’’lawsuit” occur among the 10 nearest neighbors 
to ’’suit” in all three models, whereas other terms 
are specific for one particular model. Yet all three 
models feature neighbors of ’’suit” that represent 
different senses: the way ’’suit” is not related to 
’’jacket” in the same way it is related to ’’lawsuit”. 

It has been argued that distributional seman¬ 
tic models that represent terms by a single vec¬ 
tor cannot adequately handle polysemy, since they 
conflate several different usage patterns in one and 


the same vector (Erk and Pado, 2010 Veronis 


20041. Examples like the one above is often cited 


as evidence. We argue that this critique is un¬ 
founded and misinformed, and that it is the mode 
of querying the distributional semantic model that 
can be susceptible to problems with polysemy. As 
the above example demonstrates, querying distri¬ 
butional semantic models by fc-NN conflates differ- 


^We use a Wikipedia dump from 2010 as data in 
this and following experiments. 


ent usages of terms. The reason for this seems quite 
obvious: simply ranking the nearest neighbors by 
similarity (or distance) ignores any local structures 
of the neighborhood. If ’’suit” has as neighbors 
both ’’dress” and ’’lawsuit”, which represent two 
distinct types of usages of ’’suit”, there will be a 
structural distinction in the neighborhood of ’’suit” 
between these different neighbors, since they will 
be mutually unrelated (i.e. there is a similarity be¬ 
tween ’’suit” and ’’dress” and between ’’suit” and 
’’lawsuit”, but not between’’dress” and ’’lawsuit”). 

fc-NN also gives rise to another problem re¬ 
lated to polysemy in distributional semantic mod¬ 
els. The problem is that the most frequent senses 
will populate the top of the nearest neighbor list, 
while the less frequent senses will not appear un¬ 
til further down the list, and if we set a too re¬ 
strictive fc, we will only see neighbors relating 
to the most frequent sense. Consider, for exam¬ 
ple, a term such as ’’suit”, which, as we have 
seen above, may appear in (at least) two differ¬ 
ent senses: in usages related to law and in us¬ 
ages related to clothes (or garment). The dis¬ 
tributional vector can be thought of as a sum 

'^suit fsuit\law’^suit\la'w 4 “ fsuit\clothes^ suit\clothes ^ 

where Vsuit\iaw is an idealized notion of the true 
distributional vector of’’suit” in the law-sense, and 
fsuit\iaw is the relative frequency of this sensej^ 
From there one can easily argue that a similar¬ 
ity such as s{vsuit, Vgarment) is actually a weighted 
composite of the similarities s{vsuit\iaw,Vgarment) 
and Q ff suit occurs pre¬ 

dominantly in the law-sense m our corpus, the k- 
NN neighborhood of ’’suit” will be dominated by 
words pertaining to its law-sense, while the less 
frequent senses might not be present at all. A mis¬ 
guided k may thus obscure any other, less frequent, 
senses of a term. 

Another problem with setting a global k in dis¬ 
tributional semantic models is that some terms will 
have a much denser neighborhood than others. Us¬ 
ing the same k for all terms therefore seems ill- 
advised; terms with a dense neighborhood warrant 
a larger k than those with a sparse neighborhood. 
As we have already touched upon in the introduc¬ 
tion, determining fc is a fundamental question in 
/c-NN, for which there seems to be no clear so¬ 
lution. A more informed approach compared to 
setting a global k would be to consider the dis¬ 
tribution of distances/similarities and attempt to 
find a gap in the distribution at which to cut off 
the list. However, the distribution of similarities in 
distributional semantic models typically does not 


^Weighting schemes muddles this notion quite a bit, 
but we think the general intuition still holds. 

^In the case of cosine similarity this follows nicely 
from the distributive property of dot products: v = 

,7 / \ v-w a(vi ■w)-\-b(v‘ 2 -w) 

avi + bV2, s(v, W) = —- L 


v-w 

MlMI 





















have any clear gaps, as exemplified in Figure 
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Figure 2: Distribution of similarities for the 100 
nearest neighbors of the word suit in the three dis¬ 
tributional semantic models used in this paper. 



Figure 3: Boxplots of the similarities of 1000 ran¬ 
domly picked word pairs (right), and of the mean 
similarities to the 10 nearest neighbors for 1000 
randomly chosen words (left) in the three distribu¬ 
tional semantic models used in this paper. 


Note that the curves behave approximately the 
same in all three models; there are a few (one or 
two) very close neighbors, and then the similari¬ 
ties decrease very slowly. The difference in mag¬ 
nitude of the similarities between the models is 
not peculiar for the word ” suit”. On the con¬ 
trary, PMI, GloVe, and the skipgam model pro¬ 
duce vector spaces with different inherent densi¬ 
ties. Figure shows both the similarities to 1000 
randomly selected points, as well as the similari¬ 
ties to the 10 nearest neighbors to 1000 randomly 
selected points. The skipgram model produces the 
highest similarity scores both for related and un¬ 
related points, while the PMI model produces the 
lowest scores for both related and unrelated points. 
The GloVe model is in between. All models show 
a more or less clear distinction between the av¬ 
erage similarities to randomly chosen points and 
the average similarities to the nearest neighbors. 
This distinction suggest that it might be possible 
to use the expected similarity to a randomly se¬ 
lected point as a cut-off threshold for /c-NN. How¬ 
ever, such a global estimate will not be suitable 
for all terms, for the very same reason alluded 
to above; different terms have different densities 
of their neighborhoods. Furthermore, it seems as 
if the PMI model distinguishes more clearly be¬ 
tween the related and the unrelated points, with 
the skipgram model having the most outliers. This 
suggests a global estimate might be more useful 
in some types of models (like the standard PMI 
model) than in others (like the skipgram model). 

3 Word-sense Induction 

Selecting a relevant k for a given term and group¬ 
ing the neighbors according to which senses they 
represent is an example of word-sense induction. 
Distributional semantic models are well suited for 
this task, and there have been a number of different 
approaches suggested in the literature, which can 
roughly be divided into context clustering and word 


clustering approaches. Gontext clustering does not 
operate on nearest neighbor lists, but instead clus¬ 
ters representations of each individual occurrence 
of a term. (Schiitze, 19981 is one of the earliest ex¬ 
amples of a context clustering approach, in which 
context vectors (the centroid of the distributional 
vectors of the terms that occur in the context) for 
a given term are clustered into a set of sense vec¬ 
tors that represent the induced senses. Other ex¬ 


amples of context clustering include (Purandare 


and Pedersen, 20041, (Velldal, 2005), (Reisinger 


and Mooney, 2010), (Pedersen, 2010), and (Jur¬ 


gens and Stevens, 2010). 


In contrast to context clustering , word clus¬ 
tering clusters the nearest neighbors into sense 
groups, and are thus the type of approach that 
is most relevant for our purposes. The earliest ex¬ 
ample of a word clustering approach is distribu¬ 
tional clustering (Pereira et ah, 1993), which clus¬ 
ters nouns that occur as heads of direct objects of 
verbs according to their distributional similarity. 
The resulting noun clusters for a verb can be in¬ 
terpreted as a representation of the different senses 
of the verb. 

Another example of a word clustering approach 
is clustering by committee (Pantel and Lin, 2002), 
which is a distributional clustering procedure in 
several steps. The first step is to use average- 
link clustering to recursively cluster the nearest 
neighbors into a set of clusters called committees. 
The committees are then used to define clusters 
by iteratively adding committees whose similarity 
to the term exceeds a certain threshold, and that 
is not too similar to any other added committee. 
For each added committee, its features are also re¬ 
moved from the distributional representation of the 
lexeme. This last step ensures that the clusters do 
not become too similar, and that clusters repre¬ 
senting less frequent senses can be discovered. 

The idea of iteratively removing features from 
the distributional vector when a sense cluster 


as been formed is also present in (Dorow and 





















































Widdows, 2003), who use a graph-based clus¬ 


tering method (Markov clustering (van Dongen, 
20001) to cluster the nearest distributional neigh¬ 


bors of a lexeme. Another graph-based approach 
to word-sense induction is the HyperLex algorithm 


(Veronis, 2004), which constructs a graph connect¬ 


ing all pairs of terms that co-occur in the context of 
an ambiguous term. The resulting graph contains 
highly connected components (hubs), which repre¬ 
sent the different senses of the term. (Agirre et al.. 


2006) compares HyperLex to PageRank (Brin and 


Page, 1998) and demonstrates that the two meth¬ 


ods perform similarly on a word-sense induction 
task. Other examples of graph-based approaches 


include ( 

3iemann, 2006 

1, ( 

Klapaftis and Manand- 

har, 2008 

), and ( 

Marco and Navigli, 2013 

1. 


There have also been several attempts to use var¬ 
ious types of matrix factorization to perform word 
sense induction. The idea is that the factoriza¬ 
tion uncovers a set of global senses in the form of 
the latent factors, and that the sense distribution 
for a given term can be described as a distribu- 


tion over these latent factors. ( 

Brody and Lap- 

ata, 2009 

), (Seaghdha and Korhonen, 2011 

1, 

Yao 

and Van Durme, 2011 

), and ( 

Lau et ah, 2012 

use 


different versions of Latent Dirichlet Allocation to 


produce the factorization, while (Dinu and Lapata 


2010) and (Van de Cruys and Apidianaki, 2011) 


instead experiment with non-negative matrix fac¬ 
torization. 


(Tomuro et al., 2007) argues that clustering ap¬ 


proaches like distributional clustering or clustering 
by committee may produce clusters that are them¬ 
selves polysemous, which may not be a desirable 
property of a word sense induction algorithm. As 
a solution to this problem, Tomuro et al. suggest 
using feature domain similarity, which refers to the 
similarity between the features of items rather than 
the similarity between the items themselves. The 
domain feature similarity score is incorporated in 
a modified version of the clustering by committee 
algorithm, in which the algorithm is run twice, us¬ 
ing the output of the first run as input to the sec¬ 
ond run. The idea is that this iterative approach 
may enable the algorithm to utilize higher-order 
features, and that this will inhibit the formation of 
polysemous clusters, since the domain feature sim¬ 
ilarity of a polysemous cluster will be lower than 
the score for a monosemous cluster. 


(Koptjevskaja Tamm and Sahlgren, 2014) also 


leverage on the idea of using feature similarity as 
the basis of sense clustering. The approach, called 
syntagmatically labeled partitioning, relies on a dis¬ 
tributional semantic model that encodes sequential 
as well as substitutable relations. The method es¬ 
sentially sorts the k nearest (substitutable) neigh¬ 
bors according to which sequential connections 
they share. The resulting partitioning of the near¬ 


est distributional neighbors does not only consti¬ 
tute a word-sense induction, but it also provides 
labels for the induced senses in the form of the se¬ 
quential connections the neighbors share. 


4 Neighborhood Graphs 


Many of the previous approaches to word-sense 
induction mentioned in the previous section op¬ 
erate at a global level, utilizing global structural 
properties of the semantic spaces, e.g. by matrix 
factorization techniques. We believe this is as ill- 
advised as setting a global k or radius for the near¬ 
est neighbor search, since it is the local structures 
that are important when analyzing nearest neigh¬ 
bors. Other approaches to word-sense induction 
use various forms of clustering techniques. How¬ 
ever, previous studies of the intrinsic dimensional¬ 
ity of distributional semantic spaces using fractal 
dimensions indicate that neighborhoods in seman¬ 
tic space have a filamentary rather than clustered 


structure (Karlgren et al., 2008). 


We therefore propose the use of topological mod¬ 
els that take the local structure of neighborhoods 
in semantic space into account. The method pro¬ 
posed here performs no global clustering, does not 
concern itself with grammatical preprocessing or 
parsing, and the distributional vectors are taken 
as is. The approach discovers different word senses 
from the local structure of neighborhoods, given 
nothing but similarities between points. As such it 
is easy to test on widely different vector models, as 
long as there exists a well behaved similarity func¬ 
tion. The proposed approach not only answers the 
question which other terms are similar to a given 
term, but also how are they similar. 

Relative neighborhoods are examples of empty 


region graphs (Cardinal et ah, 2009), where points 


are neighbors if some region between them is 
empty. For relative neighborhood graphs the re¬ 
gion between two points a and c belonging to some 
set of points V is defined as the intersection of 
the two spheres with centers in a and c, with ra¬ 
dius d{a,c). In other words, a point b lies between 
points a and c if it is closer to both a and c than 
a and c are to each other, and if no such point b 
exists, a and c are neighbors. Illustrations of this 
can be seen in Figure]^ 




Figure 4: Example of when point b is between point 
a and c (left), and when it is not (right). 


























































Such neighborhoods have been argued to better 

and be 


preserve local topology (Bremer et ah, 


more robust to deformations of the data than k- 
NN neighborhoods (Correa and Lindstrom, 20121 
as they in some sense contain information about 
direction whereas /c-NN neighborhoods only con¬ 
tain information about distance. Going back to the 
’’suit” example, we can see that if ’’suit” in the law 
sense is more similar to the composite ’’suit” than 
to its clothes sense, and vice versa, then the com¬ 
posite Vsuit lies between Vsnit\iaw and Vsuit\ciothes- 
This in turn means that out of those two points, 
both are relative neighbors to ’’suit”, and neither 
of them lies between the other and ’’suit”. 

Formally, the set of points between two points 
a,cGV can be characterized and computed in the 
following way: 


betweens(C, a, c) = {b\h GV,b lies between a and c} 


rng-nbh(l/, a) = {c|c G V, betweens(y, a, c) = 0} 
Erng(y) = {(a, b)\a G V,b e rng-nbh(y, a)} 


where E^g is the undirected edge set of the RNG. 
The function betweens(y, a, c) can be straightfor¬ 
wardly translated to an algorithm taking 0(|R|) 
time, making the rng-nbh function take (!l(|Gp) 
time, which in turn makes the computation of the 
complete graph take 0{\V\^) timej^ Clearly unfea¬ 
sible, but we have not found any alternatives that 
performs better in the high dimensional casej^ 

In (Correa and Lindstrom, 2012) it is noted that 
the intersection of the relative neighborhood graph 
and the fc-NN graph is a more feasible alternative: 


k-rng-nbh(y, a) = rng-nbh(y^, a) 


where V' = k nearest neighbors of a 

Given a precompiled — i.e. constant time — k- 
NN lookup, the above takes 0[k’^) time, so using 
a heap-based 0{\V\\gk) k-NN algorithm results in 
an algorithm taking 0{k^ + \V\ Ig k) time. 

The same idea can be used to build a tree struc¬ 
ture — here called relative neighborhood tree — 
rooted in a reference word a , in the following way: 


rnbh-tree(C, a) = {(c, arg minc?(6, c))|c G V} 

bGBc 

where Be = {a} U betweens(C, a, c) 

This can easily be restricted to the k-nearest neigh¬ 
bors of a in much the same way as above. 

Computing this for a point a produces a tree 
where the direct children of a are its relative neigh¬ 
bors, and the parent of a point c further down the 
tree is the point between a and c that is closest to 

^Assuming a constant time distance function. 

®It should be noted that there are more efficient 
algorithms for lower dimensional situations. 


c. Figure [5] illustrates what the resulting structure 
looks like for ’’heart” on its 100 nearest neighbors 
in the PMI model. Note that the root ’’heart” 
(at the mid-left in the graph) only has two relative 
neighbors: ’’cardiac” and’’soul.” One advantage of 
using this type of structure for the neighborhood 
is that it enables us to examine various depths of 
the tree. Depth one includes only the direct neigh¬ 
bors (’’cardiac” and ’’soul”); depth two includes 
all neighbors two steps away in the graph: ’’dis¬ 
ease,” ’’coronary,” ’’pulmonary,” ’’cardiovascular,” 
’’ventricular,” and ’’failure,” which are all children 
to ’’cardiac;” depth three also includes the neigh¬ 
bors ’’kidney,” ’’severe,” ’’complications,” and’’dis¬ 
eases” as children to ’’disease,” ’’atrial” and ’’ar¬ 
rhythmias” as children to ’’ventricular,” and the 
neighbors ’’respiratory,” ’’lung,” ’’tumors,” ’’aor¬ 
tic” as children to ’’pulmonary.” This tree structure 
can be used to identify neighbors that are them¬ 
selves polysemous (c.f. the critique mentioned in 
Section of clustering-based approaches to word- 
sense induction that they may produce polysemous 
clusters ( [Tomuro et ah, 2007 )). One example is 
the neighbor ” disease” at depth two, which has six 
children that refer to different aspects of disease. 



Figure 5: Relative neighborhood tree for ’’heart” in 
the PMI model, restricted to the 100 closest neigh¬ 
bors. 


This tree-structure is thus quite useful in the 
context of word-sense induction, since the branch¬ 
ing structure indicates different usages, and the 
depth factor enables us to calibrate the granular¬ 
ity of the induced word senses. If we only consider 
direct neighbors (i.e. depth one), and set k = V 
(i.e. we do an exhaustive nearest neighbor search), 
we will extract all terms that have a direct con¬ 
nection to the reference term. We refer to this 
neighborhood as the semantie horizon. At the 
most coarse level of analysis, this is the neighbor¬ 
hood that represents the main induced senses of a 
term. Tables and provide examples of 1000- 
RNG neighborhoods. 

These examples demonstrate some interesting 









Table 2: Relative neighborhood of the words 
“suit,” ’’orange,” and ’’heart” in three different se¬ 
mantic models. The numbers in parenthesis indi¬ 
cate the /c-NN ranks of the neighbors. 


PMI 

GloVe 

skipgram 

suit 

suits (1) 
dress (2) 
lawsuit (10) 
dinosaur (53) 
costly (60) 
option (76) 
counterparts (99) 
predator (107) 
trump (109) 

suits (1) 
lawsuit (2) 
mobile (33) 
gundam (34) 
trump (55) 
zoot (133) 
rebid (423) 
serenaders (458) 
hev (987) 

suits (1) 
lawsuit (2) 

orange 

yellow (1) 
lemon (16) 

yellow (1) 
ktype (12) 
lemon (14) 
citrus (17) 
jersey (21) 
cherry (24) 
county (26) 
peel (42) 
jumpsuits (57) 

redorange (1) 

heart 

cardiac (1) 
soul (22) 
hearts (183) 
ashtray(641) 
rags(771) 

my (1) 
blood (2) 
throbs (3) 
suffering (4) 
brain (6) 
cardiac (8) 
hearts (11) 
throb (17) 
lungs (22) 

congestive (1) 
hearts (2) 


similarities and differences between the three mod¬ 
els. First of all, there are some direct neighbors 
that are present in all three models: ’’suit” has 
’’suits” and ’’lawsuit” as direct neighbors in all 
three models, ’’heart” has ’’hearts,” ’’service” has 
’’services,” and ’’above” has ’’below”. Plural forms 
are of course reasonable neighbors of their singular 
counterparts in a semantic model, but their useful¬ 
ness for word-sense induction can perhaps be ques¬ 
tioned. Taking ’’suits” to indicate the garment- 
related sense of ’’suit,” all three models produce 
both a garment-related and lary-related sense. For 
’’orange,” the skipgram model only represents the 
color sense, while the PMI and GloVe models also 
feature a fruit sense. For ’’heart,” all three mod¬ 
els have a disease sense (represented by the neigh¬ 
bors ’’cardiac” in the PMI and GloVe models, and 
the neighbor ’’congestive” in the skipgram model), 
and an organ sense (represented by the plural form 
’’hearts”). ’’Service” is a comparably vague term 
that has a number of different senses in the PMI 
and GloVe models, nut only one in the skipgram 
model. ’’Bad” produces both a negativity sense 
and a German spa town-sense in all three models, 
both only the GloVe and skipgram models have 
a separate antonym sense (’’good” is not a direct 


neighbor in the PMI model). ’’Above” has both 
the antonym and direct neighbors relating to mea¬ 
surements in all three models. 

GloVe produces the most branched neighbor¬ 
hoods, with a large number of direct neighbors, 
while the skipgram model produces the least 
branched neighborhoods with at most a couple of 
direct neighbors for each term. The PMI model 
is somewhere in between. One reason why GloVe 
produces such branching neighborhoods is that 
GloVe seems to capture not only semantic rela¬ 
tions but also a significant amount of sequential 
relations. Many of the neighbors in the fc-RNG 
for GloVe come from collocations: the fc-RNG 
for ’’suit” includes ’’mobile” and ’’gundam,” which 
come from the collocation ’’mobile suit gundam” 
that is an anime series, ’’trump” that relates to 
’’trump suits” in card games, ’’serenaders” that re¬ 
fer to the retro string band ’’cheap suit serenaders,” 
and the very distant neighbor ”hev,” which comes 
from ”hev suit” that relates to the Half-life series of 
first person shooter games. For ’’orange” we find 
’’ktype” that comes from the collocation ’’ktype 
stars,” which is another term for ’’orange dwarfs”, 
as well as the collocations ’’orange peel”, ’’orange 
county”, ’’orange jumpsuit”, ’’cherry orange”, and 
so on. The fc-RNG for ’’heart,” ’’service,” ’’bad,” 
and ’’above” also feature a number of collocations 
for the GloVe model. There are also some examples 
of neighbors from sequential relations in the PMI 
model (e.g. ’’costly” as neighbor to ’’suit” from 
the collocation ’’costly suit,” ’’luck” and ’’donners- 
bergkreis” as neighbors to ’’bad” from the colloca¬ 
tions ’’bad luck” and ’’bad donnersbergkreis”), but 
this tendency is not at all as pronounced as it is 
for the GloVe model. 

The PMI and GloVe models produce the struc¬ 
turally most similar RNGs, with on average a 
handful of direct neighbors, of which some can be 
very distant. The skipgram model on the other 
hand produces very few direct neighbors. This 
led us to look further into the structural proper¬ 
ties of neighborhoods in the skipgram model. An 
interesting observation — and possible complica¬ 
tion — is that the neighborhoods in the skipgram 
model are highly asymmetric: the first neighbor 
of’’information” is ’’informations”, whereas ’’infor¬ 
mation” is only the 1829th neighbor of ’’informa¬ 
tions.” While such asymmetry occurs in all mod¬ 
els, it seems much more prevalent in the skipgram 
model. Figure|^demonstrates this: each point cor¬ 
responds to a random word pair (a, b) with x cor¬ 
responding to where b is in the ordered list of a’s 
neighbor, and y to where a is in the ordered list of 
b’s neighbors, or equivalently: x is the number of 
points within d(a, b) of a and y is the number of 
points within d(a, b) of b. This implies that the 
local densities vary much more in the skipgram 















Table 3: Relative neighbors to the word ’’service,” 
“bad,” and ’’above” in three different semantic 
models. The numbers in parenthesis indicate the 
/c-NN ranks of the neighbors. 


PMI 

GloVe 

skipgram 

service 

services (1) 
network (2) 
operates (8) 
launched (18) 
served (22) 
intercity(34) 

services (1) 
operated (3) 
serving (6) 
military (17) 
duty (20) 
passenger (21) 
dialaride (644) 
aftersales (759) 
limitedstop (802) 

services (1) 

bad 

terrible (1) 
that (2) 
luck (39) 
unfortunate (70) 
stalling (276) 
donnersbergkreis (860) 
rancid (980) 

good (1) 
kissingen (2) 
ugly (45) 
nasty (48) 
dirty (106) 
omen (328) 
conkers (360) 
karma (952) 

nauheim (1) 
good (2) 
dreadful(5) 

above 

below (1) 
around (2) 
feet (5) 

measuring (29) 
beneath (36) 
columns (62) 
atop (102) 

below (1) 
level (2) 
height (3) 
just (4) 
stands (10) 
lower (11) 
beneath (12) 
rise (21) 
sea (30) 

below (1) 

500ft (2) 


model than in the others, which can complicate 
the choice of k in the fc-RNG algorithm. 



word-sense induction. We have also argued that 
the topology of the local neighborhoods in seman¬ 
tic models can be used for selecting a relevant set 
of neighbors — a factor we have referred to as the 
semantic horizon. 

We have introduced relative neighborhood 
graphs (RNG) as an alternative to standard k- 
NN, and we have illustrated how fc-RNG can be 
used as a tool for analyzing the topology of lo¬ 
cal neighborhoods in semantic models. We have 
exemplified relative neighborhoods in three differ¬ 
ent well-known semantic models; the standard PMI 
model, as well as the more recent GloVe and skip- 
gram models. The examples provided in this pa¬ 
per demonstrate that fc-RNG can be used for word- 
sense induction, but that such topological methods 
are more suitable to use for certain types of seman¬ 
tic models. The k-RNG for the PMI and GloVe 
models produce pleasant results, while the skip- 
gram model, with its big local variations in den¬ 
sity, produces less informative results. It’s quite 
possible that the complete RNG overcomes these 
problems, but that does not seem a feasible solu¬ 
tion. 

This illustrates how fc-RNG can be used as a 
tool to gain insight into the topological properties 
of different models. We have also observed that 
the GloVe model often produces neighbors that 
correspond to various collocations, which means 
that this model is not strictly a semantic repre¬ 
sentation, since it confounds substitutable and se¬ 
quential relations. A more sophisticated tokeniza- 
tion, taking n-grams into account, might alleviate 
this. The standard PMI model is nowadays often 
overlooked in favor of more recent neural network- 
inspired models, but our results indicate that the 
PMI model has a number of comparatively attrac¬ 
tive properties that are useful for linguistic appli¬ 
cations such as word-sense induction. 


Figure 6: neighborhood reciprocity in the different 
models. 

5 Conclusions 

In this paper we have discussed the question how 
to query semantic models, which is a question that 
has been long neglected in research on computa¬ 
tional semantics. Nearest neighbor search (or fc- 
NN) is often treated as the only available option, 
which leads to misunderstandings regarding how 
semantic models represent and handle vagueness 
and polysemy. We have argued that the struc¬ 
ture — or topology — of the local neighborhoods 
in semantic models carry useful semantic informa¬ 
tion regarding the different usages — or senses - of 
a term, and that such topological properties thus 
can be used to analyze polysemy and to perform 
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